SGI Developer Toolbox 6.1

home *** CD-ROM | disk | FTP | other *** search

/ SGI Developer Toolbox 6.1 / SGI Developer Toolbox 6.1 - Disc 1.iso / toolbox / documents / audio / audio.apps / dev / sp.hints < prev next >

Wrap

Text File | 1996-11-11 | 15.2 KB | 334 lines

<html> <title> Performance Hints for Signal-Processing Applications </title> <body> <h2> Performance Hints for Signal-Processing Applications </h2> The MIPS processor, particularly the newer ones, can execute a substantial amount of real-time signal processing code. But as with all processors, writing efficient code requires an understanding of the processor and compiler architectures. This list is intended to give you some things to think about. First, my disclaimer. I've based this list upon a number of "gotchas" that I and other audio & image-processing developers have hit. This is by no means a complete list, but it's hopefully got some useful hints. It's a new list and I'm constantly refining it and adding things. If you have your own "gotchas" or suggestions, please send them in! Thanks to Bruce Karsh, Gints Klimanis, Chris Pirazzi, and Bryan James for their review and contributions. <a href="http://reality.sgi.com/employees/cook/"> Doug Cook </a> <a href="mailto:cook@sgi.com"> (cook@sgi.com) </a> <hr> 1. There are few "magic bullets", only guidelines and trade-offs. What speeds up one app may slow-down another if the applications are architected differently. Take the time to understand and profile your application to confirm that these guidelines help or don't help. <hr> 2. Consider the data type you need for each operation. Don't make the assumption that integer types are faster than floating-point. This is often not true. Here are some gross generalizations about what's best for different types of operations: <pre> Bit-Swizzling & Logical Ops Integer only Addition Integer faster than FP General Multiplication FP faster than integer Multiplication by a Constant Integer is often faster if you can write it as a small number of shifts & adds. Otherwise FP is faster. Division FP is faster. But if you can replace division by multiplication, do so; multiplication is faster. Conditionals Integer is faster </pre> Another thing to consider is that the FP unit has its own register set. When you use FP, you effectively get a much larger register set to use, because your logical code uses the integer registers, whereas the computations use the FP registers. With more registers, some loops may require fewer load and store operations. In general, for the typical signal-processing application, which requires a lot of multiply/adds of variables, floating-point offers superior performance. But as always, it depends upon the architecture of the application; your mileage may vary. 2a. Another note: one common speed-up for integer operations is to use table-lookup, where possible. This often works well for image processing, where the input is 8- or 10-bit data and the tables are small enough to stay in the cache. For audio operations, table-lookup is less common, since audio data is typically 16-24 bits. <hr> 3. Avoid unnecessary type conversion. This is a trade-off with #1. Conversion between integer and floating-point can be expensive (particularly FP->int, which sometimes requires limiting). If your application accepts and produces integer data and only does a small amount of signal-processing, the cost of conversion to FP may outweigh the gains in performance from using FP. Also, be aware: in some cases, the compiler may do type conversion "behind your back," particularly by promoting single-precision to double-precision floating-point (hint #7). <hr> 4. Consider the instruction set (ISA) you want to use. If you're writing in C, selecting an instruction set requires just a compiler flag, e.g. -mips2. There are currently 4 MIPS instruction sets (MIPS 1-4). Each set is a superset of the previous. Roughly speaking, here's how they work: <pre> MIPS 1 R3000 MIPS 2 R4000 instructions MIPS 3 R4000 -- introduces 64-bit operations MIPS 4 R8000/R10000 instructions notably floating-point multiply/add, conditional FP move </pre> I've noticed that one big benefit of going from MIPS1->MIPS2 is a reduction in the cost of FP->integer conversion. For MIPS1, the compiler will generate some inline code to perform this; for MIPS2, the code is replaced by a single instruction. In general, though, you should not expect a huge performance gain going mips1->mips2. The MIPS4 instruction set provides some nice opportunities for signal-processing applications. It provides an FP multiply/add instruction and floating-point conditional-move instructions. These last allow one to write branch-free limiting code, which is a big win in some cases (see #5). With each instruction set, you may get a performance gain at the cost of compatibility with older MIPS-architecture processors. In other words, if you compile -mips2, you may see some speedup, but you lose R3000 compatibility. This is a trade-off you will have to consider. Some people maintain multiple versions of a binary using different ISA's. (If your application uses "inst" for installation, you can have "inst" automatically select the appropriate binary for the end-user's machine). <hr> 4. Disassemble your code and look at it. Use the "dis" command on your actual binary, not the code generated by "cc -S" (the compiler lies, often substituting higher-level macros for more complex things that actually occur in the binary itself). There's really no substitute for this. Looking at the source-code and thinking about the algorithms at a high-level is a first step, but in the end if you need to squeeze every bit of performance out of your code, you need to examine the actual instructions that the CPU will execute. <hr> 5. Avoid branches where possible in critical code. If you can replace a bit of frequently executed code by an equivalent branch-free piece of code, it may run in fewer cycles though it has more instructions. This is because the MIPS processors are pipelined, and for some of them, a branch requires several extra cycles to refill the pipeline. Another reason to avoid branches is that many of the compiler optimizations are limited to basic blocks, i.e. the region between two branches. Removing branches often gives the compiler a better chance to optimize the code. Finally, branches inside loops reduce the effectiveness of loop unrolling. <hr> 6. Work with the cache. Though the processor is very fast, the memory system is relatively slow. If you don't get good cache performance, your application will suck wind. A good rule of thumb is the order-of-magnitude rule. Think of each successive level of cache, going from the processor out to main memory, as 10 times as slow as the previous level. Suppose an instruction which hits the primary cache takes 1 cycle. The same instruction may require 10 cycles if it misses the primary cache and hits the secondary cache, and 100 cycles if it misses the secondary cache and requires a cache-line fetch from main memory. The effect is more pronounced as processors become faster. Cache effects can thus make a huge difference in program performance. Organize your code and your data to optimize your use of the cache. Minimize your inner loops and working sets of data. For example, in image-processing, "tiling" is a common practice. Suppose a number of consecutive operations are to be performed on an image. The "obvious" way would be to perform each operation on the entire image, then proceed to the next operation. However, assuming the image doesn't fit in the cache, this gets the worst possible cache performance, since the image has to be reloaded from main memory every time it is touched. A far superior approach is to operate on small "tiles" which fit in the cache: perform all the operations on one tile, then proceed to the next. There are lots of other tricks to improve cache performance, but hopefully this short example will get you thinking. If you need data in a critical loop, try to make sure it's in the cache every time the loop is executed. <hr> 7. Watch out for floating-point promotion. Double-precision arithmetic is typically more costly than single-precision, so if you don't need it, make sure you're not using it. In some cases the compiler will automatically promote operands to double-precision "behind your back." Here are some things to look for. Check out the "-float" option on the C compiler, which instructs the compiler (in K&R mode) to avoid promoting operands to double-precision. It's also a very good idea to use function prototypes with functions taking "float" arguments -- otherwise the arguments will be promoted to double-precision (even with the -float option). Finally, if a floating-point constant is to be single-precision, it should be given in the form "0.0f", i.e. explicitly single-precision, to avoid promotion to double-precision. <hr> 8. Don't assume that assembly language is inherently better than C code. Write your algorithm in C unless you find by disassembly that the compiler is missing some possible optimizations, or unless there's some special trick you need that can really only be expressed in assembly. In most cases you'll find that the compiler does a really good job. <hr> 9. Don't assume that pointer arithmetic is inherently better than array indexing. The normal addressing mode of the MIPS processor (a fixed offset from a register) lends itself nicely to array indexing. Consider the following two loops: <pre> for (i=0; i < 50; i++) { array[i]=0; } int *p=array; for (i=0; i< 50; i++) { *p++=0; } </pre> Both are functionally equivalent if the value of p is not used later in the second case. Some people shy away from the first approach on the mistaken belief that the address generation for the array indexing is computationally expensive. In fact, the compiler generates code for the first version which has 20% fewer instructions per iteration than the code for the second version. Why? The compiler often generates better code for simpler expressions, because it can better understand what the code is trying to accomplish. In the first case, the compiler sees that the code is looping on an array index. It calculates the end address of the array (200 from the base, since it is an integer array), and generates a loop which has only one variable -- the address in the array. "i" as such is nowhere to be seen. The loop requires four instructions per iteration: <pre> [test1.c: 5] 0x4009a8: 248500c8 addiu a1,a0,200 [test1.c: 5] 0x4009ac: 24840004 addiu a0,a0,4 [test1.c: 5] 0x4009b0: 0085082b sltu at,a0,a1 [test1.c: 5] 0x4009b4: 1420fffd bne at,zero,0x4009ac [test1.c: 6] 0x4009b8: ac80fffc sw zero,-4(a0) </pre> The second example is too "smart" for the compiler. The compiler can't quite figure out what the programmer is trying to accomplish, so it generates two loop variables, one for "i" and one for "p" (even though "p" is never used after the loop). This example requires 5 instructions per iteration, since it has to increment two loop variables. <pre> [test2.c: 4] 0x4009a8: 24020032 li v0,50 [test2.c: 6] 0x4009ac: 00002025 move a0,zero [test2.c: 6] 0x4009b0: 24840001 addiu a0,a0,1 [test2.c: 6] 0x4009b4: 0082082a slt at,a0,v0 [test2.c: 7] 0x4009b8: ac600000 sw zero,0(v1) [test2.c: 6] 0x4009bc: 1420fffc bne at,zero,0x4009b0 [test2.c: 7] 0x4009c0: 24630004 addiu v1,v1,4 </pre> (if you're not familiar with MIPS assembly, note that the instruction after the branch [bne] is executed before the branch actually occurs. This is the so-called "branch delay slot"). Some folks have tested numerous ways of stating loops to see for which the compiler generates the fastest code. This is an interesting experiment, but it may not hold true from compiler release to compiler release. However, the general principle holds: don't be afraid to state your loops in simple terms. Often the compiler does best with the simplest constructs. <hr> 10. Watch out for expressions hidden in macros. Know what's a macro and what's a function. Expressions hidden in macros will sometimes be evaluated multiple times; this causes a performance hit if it's not required for correct evaluation of the macro. <hr> 11. Consider unrolling your loops. Often the compiler will do this for you, but you can sometimes glean a little more performance by doing it yourself. Loop unrolling works as follows. Consider the following loop: <pre> for (i=0; i < x; i++) { array[i]=0; } </pre> We've seen above (#9) that the compiler generates somewhat smart code for this. For each iteration of the loop, the pointer is incremented and tested against its end value. But we can cut down on the number of instructions per array item if we write the loop like this: <pre> for (i=0; i < x; i+=4) { array[i]=0; array[i+1]=0; array[i+2]=0; array[i+3]=0; } </pre> This loop is faster than the first one. The trick is that the pointer increment and test-against-end-value need only be done for every 4 array elements, instead of for every array element. There's no extra computation overhead associated with the (i+n) indices, because these make use of the MIPS addressing mode: the compiler can determine the fixed offset (represented by n) from the base address represented by i. Also, since there are fewer iterations, there are fewer branches. As noted above, a pipelined processor likes to execute instructions linearly (#5). The compiler-generated code looks like: <pre> [test3.c: 6] 0x4009a8: 248500c8 addiu a1,a0,200 [test3.c: 6] 0x4009ac: 24840010 addiu a0,a0,16 [test3.c: 6] 0x4009b0: 0085082b sltu at,a0,a1 [test3.c: 7] 0x4009b4: ac80fff0 sw zero,-16(a0) [test3.c: 8] 0x4009b8: ac80fff4 sw zero,-12(a0) [test3.c: 9] 0x4009bc: ac80fff8 sw zero,-8(a0) [test3.c: 6] 0x4009c0: 1420fffa bne at,zero,0x4009ac [test3.c: 10] 0x4009c4: ac80fffc sw zero,-4(a0) </pre> Note that this snippet assumes that x is divisible by 4, unlike the original loop. There's a way around this for general values of x: split the loop into two loops. The first is not unrolled, and processes (x & 3) elements. The remaining number of elements is guaranteed to be divisible by 4, and the second, unrolled loop handles the bulk of the elements. If x is going to be very small, it's not worth the overhead, but for most values of x, this approach is a big win. So why did we choose 4? First, it's a power of two, which makes the modulo operation cheap (x & 3, as seen above). It's also fairly small. As the loop-unrolling block size becomes larger, you get diminishing returns. At some point, things actually slow down because the code no longer gets nice cache behavior. Common values for the block size are 4, 8, and sometimes 16. Experiment with your code to see what works well. As always, if it's a critical loop, disassemble it to see what the compiler is doing for you. <hr> 12. Watch out for floating-point exceptions. These can really hurt your performance. Check out the <a href="fp.underflow"> excellent little blurb </a> that Chris Pirazzi has written, available from the audio apps page. [...more to come...] </body>